Online Clustering of Linguistic Data
نویسندگان
چکیده
Clustering text data online as it comes in is a difficult problem. It is both hard to capture a meaningful notion of linguistic similarity and to cluster large amounts of data in a single pass. This problem is especially challenging because most known algorithms that ensure tight clusterings are inefficient on large datasets. While significant work has been done on text clustering, it has not been fully explored. In this paper, we discuss previous methods in text clustering and then develop a single-pass text clustering algorithm designed specifically for clustering news stories, (but more widely applicable) and examine its empirical behavior. We then analyze some of its key design features and compare them to possible alternative methods. Finally, we discuss possibilities for further improvement of our algorithm.
منابع مشابه
A Linguistic Analysis of the Online Debate on Vaccines and Use of Fora as Information Stations and Confirmation Niche
This study looks at the communication between users concerning health risks, with the aim of exploring their use of fora and assessing whether participants establish a niche with like-minded users during these exchanges. By integrating a corpus linguistic approach with content analysis and multiple studies on computer mediated health discourse, this study analyses the intense attention paid to ...
متن کاملBotOnus: an online unsupervised method for Botnet detection
Botnets are recognized as one of the most dangerous threats to the Internet infrastructure. They are used for malicious activities such as launching distributed denial of service attacks, sending spam, and leaking personal information. Existing botnet detection methods produce a number of good ideas, but they are far from complete yet, since most of them cannot detect botnets in an early stage ...
متن کاملLearning Linguistic Descriptors of User Roles in Online Communities
Understanding the ways in which users interact with different online communities is crucial to social network analysis and community maintenance. We present an unsupervised neural model to learn linguistic descriptors for a user’s behavior over time within an online community. We show that the descriptors learned by our model capture the functional roles that users occupy in communities, in con...
متن کاملLinguistic variables determination using fuzzy clustering
The fuzzy sets defining the linguistic variable values can be seen as a fuzzy partition of the linguistic variable. The membership functions obtained using fuzzy clustering algorithms are defined with respect to the group prototypes, and they cannot be used to define the linguistic variable values. We introduce several criteria to pass from the clustering membership functions to the linguistic ...
متن کاملA new method for fuzzification of nested dummy variables by fuzzy clustering membership functions and its application in financial economy
In this study, the aim is to propose a new method for fuzzification of nested dummy variables. The fuzzification idea of dummy variables has been acquired from non-linear part of regime switching models in econometrics. In these models, the concept of transfer functions is like the notion of fuzzy membership functions, but no principle or linguistic sentence have been used for inputs. Consequen...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004